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Abstract. The use of graphics processing units for scientific compu- 
tations is an emerging strategy that can significantly speed up various 
different algorithms. In this review, we discuss advances made in the field 
y-H ' of computational physics, focusing on classical molecular dynamics, and 

^ ' on quantum simulations for electronic structure calculations using the 

density functional theory, wave function techniques, and quantum field 
theory. 
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1 Introduction 

The graphics processing unit (GPU) has been an essential part of personal com- 
puters for decades. Their role became much more important in the 90's when 
the era of 3D graphics in gaming started. One of the hallmarks of this is the 
ultimately violent first-person shooting game called DOOM by the id Software 
company, released in 1993. Wandering around the halls of slaughter, it was hard 
to imagine these games leading to any respectable science. However, now twenty 
years after the release of DOOM, the gaming industry has an enormous market, 
and the continuous need for more realistic visualizations has lead to a situation 
where the CPUs have, indeed, an enormous computational power, much higher 
than the central processing units (CPUs) , at least in terms of theoretical peak 
performance. 

The games started to have real 3D models and hardware acceleration in the 
mid 90's, but an important turning point for scientific use of GPU's for comput- 
ing was around the first years of this millennium [T] , when the programmability 
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of CPU's was introduced. Combined with the continued increase in computa- 
tional power as shown in Fig. [l] the CPUs are nowadays a serious platform 
for general purpose computing. Also, the memory bandwith in CPUs is very 
impressive. The three main vendors for CPUs, Intel, NVIDIA, and ATI/AMD, 
are all actively developing computing on CPUs. In addition, the Cell processor 
architecture by IBM, Sony, and Toshiba is also relevant for scientific activities 
and often grouped with the GPU efforts. At the moment, none of the technolo- 
gies listed above dominate the field, but NVIDIA with its CUDA programming 
environment is perhaps the market leader at the moment. 




Fig. 1. Floating point operations (FLOPS) per second for CPUs and CPUs from 
NVIDIA and Intel Corporations, figure taken from 12]. The processing power of the 
currently best GPU hardware by the AMD Corporation is comparable to the NVIDIA, 
being around 2600 GFLOPS/s. 



1.1 GPU as a computational platform 

At this point we have hopefully convinced you that the GPU is a powerful 
architecture also for general computing, but what makes GPU different from 
the current multi-core CPUs? To understand this, we can start with traditional 
graphics processing, where hardware vendors have tried to maximize the speed 
at which the pixels on the screen are calculated. These pixels are independent 
primitives that can be processed in parallel, and the number of pixels of computer 
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displays has also increased over the years from the original DOOM resolution 
of 320 x 200 corresponding to 64000 pixels to millions. The most efficient way 
to process these primitives is to have a very large number of arithmetic logical 
units (ALUs) that are able to perform a high number of operations for each 
video frame. The processing is very data-parallel, and one can view this so that 
the same arithmetic operation is performed in parallel for each primitive to be 
processed. Furthermore, as the operation is the same for each primitive, there is 
no need for very sophisticated flow control in GPU and more transistors can be 
used for arithmetics, resulting in an enormously efficient hardware for performing 
parallel computing that can be classified as "single instruction, multiple data" 



Now, for general computing on the GPU, the primitives processed are not 
anymore the pixels on the video stream, but can range from matrix elements 
in linear algebra to physics related cases where the primitives can be particle 
coordinates in classical molecular dynamics or quantum field values. From the 
traditional graphics processing we can understand that the general computing 
would be efficient when we have a situation where the same arithmetics needs to 
be performed for a large data set. It is clear that not all problems or algorithms 
have this structure, but there are luckily many cases where this is the case, and 
the list of successful examples is long. 

However, there are also limitations on GPU computing. First of all, when 
porting a CPU solution for a given problem to GPU, one might need to change 
the algorithm to suit the SIMD approach. Secondly, the communication from the 
host part of the computer to the GPU part is limited by the speed of the PCIe 
bus coupling the GPU and the host. In practice, this means that one needs to 
perform a serious amount of computing on the GPU between the data transfers 
before the GPU can actually speed up the overall computation. Of course, there 
are also cases where the computation as a whole is done on GPU, but these cases 
suffer from the somewhat smaller serial processing power of the GPU. 

2 Molecular dynamics 

Molecular dynamics (MD) refers to the type of simulation where the behaviour of 
a complex system is calculated by integrating the equation of motion of its com- 
ponents within any given model, and its goal is to observe how some ensemble- 
averaged properties of the system originate from the detailed configuration of 
its constituent particles (figure [2]). 

In its classical formulation, the dynamics of a system of particles is described 
by their Newtonian equations: 



where rrii is the particle's mass, Xi its position, and F^j is the interaction between 
the z-th and j-th particles as provided by the model chosen for the studied 
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Fig. 2. Schematic representation of the atomistic model of a macroscopic system. 

system. These second order differential equations are then discretised in the 
time domain, and integrated step by step until one is satisfied with the results. 

The principles behind MD are so simple and general that since its first ap- 
pearance in the 70s, it has been applied to a wide range of systems, at very 
different scales. Often MD is applied in the field of biophysics, since it happens 
to be the dominant theoretical tool of investigation, where structural changes in 
proteins [3141516] and lipid bilayers |7I8) interacting with drugs can be studied, 
ultimately providing a better understanding of drug delivery mechanisms. 

At larger scales, one of the most famous example is known as the Millenium 
Simulation, where the dynamics of the mass distribution of the universe at the 
age of 380000 years, was simulated up to the present day [9], showing the for- 
mation age and process of cosmic structures (galaxies, black holes and quasars), 
greatly improving our understanding of cosmological models and providing a 
theoretical comparison to satellite measurements. 

Despite the simplicity and elegance of its formulation, MD is not a compu- 
tationally easy task and often requires special infrastructure. The main issue is 
usually the evaluation of all the interactions F^ that is the most time consum- 
ing procedure of any MD calculation for large systems. Moreover, the processes 
under study might have long characteristic time scales, requiring longer simula- 
tion time and larger data storage; classical dynamics is chaotic, i.e. the outcome 
is affected by the initial conditions, and since these are in principle unknown 
and chosen at random, some particular processes of interest might not occur 
just because of the specific choice, and the simulation should be repeated several 
times. For these reasons it is important to optimise the evaluation of the forces 
as much as possible. 

Fortunately, MD is a very suitable problem for parallel computation, as all 
the interactions F+j are independent of each other and their evaluation can be 
distributed over different computing units; custom hardware architectures ded- 
icated to a specific MD simulation were also developed [10111112] . and despite 
the good performance achieved, their excessive cost and lack of general applica- 
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bility hampered their spread through scientific communities, or even beyond the 
particular scope they were initially intended for. 

Recently, GPUs attracted a lot of interest from the scientific community be- 
cause they feature a massively parallel architecture with performance on the level 
of small computer clusters, at the cost and power consumption of conventional 
commodity hardware. Even though they were originally designed for real-time 
3D rendering, the computer games market drove GPU manufacturers to fit ever 
more processing power on their devices, to the point where it became advan- 
tageous to calculate the game physics (dynamics and collisions) on the device 
itself [13114] . 

An early attempt to implement MD on GPU was proposed in 2004 [15] and 
showed promising performance; at that time, general purpose GPU computing 
was not yet a well established framework and the N-body problem had to be 
formulated as a rendering task: a shader program computed each pair interac- 
tion F ij and stored it as the pixel color values (RBG) in a N x N texture, then 
another shader would simply sum these values row-wise to obtain the total force 
on each particle and finally integrate their velocities and positions. The method 
is called all-pairs calculation, and as the name might suggest it is quite expensive 
as it requires 0(N 2 ) forces evaluations. The proposed implementation was in no 
way optimal since the measured performance was about a tenth of the nominal 
value of the device, and immediately revealed one of the main issues of the archi- 
tecture that still persists nowadays: GPUs are very fast at computing data, but 
extremely slow at handling them. The reason for the bad performance was in fact 
the large amount of memory read instructions compared to the amount of com- 
putation effectively performed on the fetched data, but despite this limitation, 
the code still outperformed a CPU by a factor of 8 because every interaction 
was computed concurrently. A wide overview of optimisation strategies to get 
around the memory latency issues can be found in Ref. [16J . while, for the less 
eager to get their hands dirty, a review of available MD software packages is 
included in Ref. [17]. 

In the current GPU programming model, the computation is distributed in 
different threads, grouped together as blocks in a grid fashion, and they are 
allowed to share data and synchronise throughout the same block; the hardware 
also offers one or two levels of cache to enhance data reuse thus reducing the 
amount of memory accesses, without harassing the programmer with manual pre- 
fetching. A more recent implementation of the all-pair calculation [18] exploiting 
the full power of the GPU can achieve performance close to the nominal values, 
comparable to several CPU nodes. 

The present, more mature, GPGPU framework allows for more elaborate 
programs to fit the device, enabling the implementation of computational tricks 
developed during the early days of MD [19], in order to integrate N-body dy- 
namics accurately with much better scaling than 0{N 2 ). For example, in many 
cases the inter-particle forces are short range, and it would be unnecessary to 
evaluate every single interaction Fij since quite many of them would be close 
to zero and just be neglected. It is good practice to build lists of neighbours for 
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Fig. 3. Illustration of different space partition methods. In dense systems (a) a regular 
grid is preferred, and neighbour particles can be searched only in a few adjacent voxels. 
Sparse system (b) are better described by hierarchical trees, excluding empty regions 
from the computation. 



each particle in order to speed up the calculation of forces: this also takes an 
0{N 2 ) operation, although the list is usually recalculated every 100-1000 time- 
steps, depending on the average mobility of the particles. Another way to exclude 
distant pairs from the calculation is to divide the simulation box in voxels and 
search for a partcle's neighbours only within the adjacent voxels (figure[3K). Per- 
formance can be further improved by sorting particles depending on the index 
of the voxel they belong, making neighbouring particles in space, to a degree, 
close in memory thus increasing coalescence and cache hit rate; such a task can 
be done with radix count sort [20121122] in O(N) with excellent performance, 
and it was shown to be the winning strategy [23.. 

Unfortunately, most often the inter-particle interactions are not exclusively 
short range and can be significant even at larger distances (electrostatic and 
gravitational forces). Therefore introducing an interaction cut-off leads to the 
wrong dynamics. For dense systems, such as bulk crystals or liquids, the electro- 
static interaction length largely exceeds the size of the simulation space, and in 
principle, one would have to include the contributions from several periodic im- 
ages of the system, although their sum is not always convergent. The preferred 
approach consists of calculating the electrostatic potential V(r) generated by 
the distribution of point charges p(r) from Poisson's equation: 

V 2 V(r) = p(r) (2) 

The electrostatic potential can be calculated by discretising the charge distribu- 
tion on a grid, and solve equation [2] with fast Fourier transform (FFT), which 
has O(MlogM) complexity (where M is the amount of grid points): this ap- 
proach is called particle- mesh Ewald (PME). Despite being heavily non-local, 
much work has been done to improve the FFT algorithm and make it cache 
efficient [24 25 26 l27|28j . so it is possible to achieve a 20-fold speed up over the 
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standard CPU FFTW or a 5-fold speedup when compared to highly optimised 
MKL implementation. The more recent multilevel summation method (MSM) 
[29] uses nested interpolations of progressive smoothing of the electrostatic po- 
tential on lattices with different resolutions, offering a good approximation of 
the electrostatic 0(N 2 ) problem in just O(N) operations. The advantage of this 
approach is the simplicity of its parallel implementation, requiring less memory 
communication among the nodes, thus leading to a better scaling than FFT 
calculation in PME. The GPU implementation of this method gave a 25-fold 
speedup over the single CPU j3U] . Equation [5] can also be translated into a lin- 
ear algebra problem using finite differences, and solved iteratively on multi-grids 
|31l32j in theoretically 0(M) operations: even though the method initially re- 
quires several iterations to converge, the solution does not change much in one 
MD step and can be used as a starting point in the following step, which in turn 
will take much fewer iterations. 

On the other hand, for sparse systems such as stars in cosmological simu- 
lations, decomposing the computational domain in regular boxes can be quite 
harmful because most of the voxels will be empty and some computing power 
and memory is wasted there. The optimal way to deal with such situation is 
to subdivide space hierarchically with an octree [33) (figure [3b), where only the 
subregions containing particles are further divided and stored. Finding neigh- 
bour particles can be done via traversal of the tree in O(NlogN) operations. 
Octrees are conventionally implemented on the CPU as dynamical data struc- 
tures where every node contains reference pointers to its parent and children, and 
possibly information regarding its position and content, although this method 
is not particularly GPU friendly since data is scattered in memory as well as 
in the simulation space. In GPU implementations, the non-empty nodes are 
stored as consecutive elements in an array or texture, and include the indices of 
children nodes |34) and proved to give good acceleration in solving the N-body 
problem 18 35 36 . Long range interactions are then calculated explicitly for the 
near neighbours, while the fast multipole method (FMM) [37)38] can be used to 
evaluate contributions from distant particles. The advantage of representing the 
system with an octree becomes now more evident: there exists a tree node con- 
taining a collection of distant particles, which can be treated as a single multipole 
leading to an overall complexity 0(N). The mathematics required by FMM is 
quite intensive to evaluate, nevertheless the algorithms involved have been de- 
veloped and extensively optimised for GPU architecture [39140141] . achieving 
excellent parallel performance even on large clusters |42[ . 

In all the examples shown here, the GPU implementation of a method out- 
performed its CPU counterpart: in many cases the speedup is only 4-5 fold when 
compared to a highly optimised CPU code, which is, in a way, a discouraging 
result. This is so because implementing an efficient GPU algorithm is quite a 
difficult task, requiring knowledge of the target hardware, and the programming 
model is not as intuitive as regular CPU: the implementation difficulty could 
be eased by developing better compilers, that check how memory is effectively 
accessed and provide higher levels of GPU optimisation on older CPU codes 
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automatically. In some cases 10 to 100 fold speedup was measured, showing that 
GPU is clearly the way to go for a defined group of problems, while it fails for 
others. This issue is quite similar to the everlasting dispute between raytracing 
and raster graphics: one can calculate explicitly photorealistic images in complex 
scenes, taking its time (CPU), while the latter resorts to every trick in the book 
to get a visually "alright" result as fast as possible (GPU). Best would be to use 
both methods to calculate what they are good for, and this sets a clear view of 
the future hardware required for scientific computing, where both simple vector- 
like processors and larger CPU cores could access the same memory resources, 
avoiding data transfer. 

3 Density-functional theory 

Density functional theory (DFT) is a popular method for ab initio electronic 
structure calculations in material physics and quantum chemistry. We use the 
following density-functional theory formulation for the ground state Kohn-Sham 
wave- functions ip n (r): 

Hip n (r) = e n ^„(r), (3) 

where H = — ^V 2 + vh (r) + v ex t(r) + v xc (r) and vjj is the Hartree potential 
defined by the Poisson equation V 2 w//(r) = — 4irp(r), v ext is the external ionic 
potential and v xc is the exchange-correlation potential. The electronic charge 
density p(r) is determined by the eigenvectors p(r) = J^i /i|V'i( r )| 2 j where the 
fi'.s are the orbital occupation numbers. 

There are several numerical approaches and approximations for solving the 
KS equation. They relate usually to to discretization of the equations and the 
treatment of the core electrons (pseudo-potential and all electron methods) . The 
most common discretizations methods in solid state physics are plane waves, 
localized orbitals, real space grids and finite elements. Generally an iterative 
procedure called self-consistent field (SCF) calculation is used to find the solution 
to the eigenproblem starting from initial guess for the charge density. 

Porting existing DFT code to GPUs generally include profiling or discovering 
with some other method the most computationally expensive parts of SCF-loop 
and reimplementing them with GPUs. Depending on the discretization meth- 
ods the known numerical bottlenecks are vector operations, matrix products, 
Fast Fourier Transforms (FFTs) and stencil operations. There are GPU versions 
of many of the standard computational libraries (like CUBLAS for BLAS and 
CUFFT for FFTW). However generally porting DFT application is generally 
not as simple as replacing the function calls with GPU equivalents since the 
resulting intermediate date gets heavily reused. 

Partial DFT implementations on GPU include calculation of the exchange- 
correlation term and evaluation of the Coulomb potential [H] in Gaus- 
sian based systems. However the first complete DFT code using double preci- 
sion arithmetic on GPUs was presented by Genovese et al. [15] , They used a 
Daubechies wavelet based code called BIGDFT [46 . The basic 3D operations 
for a wavelet based code are based on convolutions. They achieved speed-ups of 
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factor 20 for some of these operations with GPU and a factor of 6 for the whole 
hybrid code using NVIDIA Tesla S1070 cards. These speed-ups were achieved 
on a 12-node hybrid machine. 

For solid state physics plane wave (PW) basis sets are most common choice. 
The computational schemes rely heavily on linear algebra operations and fast 
Fourier transformations. Vienna ab initio Simulation Package (VASP) [47] is 
a popular code combining plane waves with projector-augmented wave (PAW) 
method. The most time consuming part of optimizing the wave-functions given 
the trial wave-functions and related routines have been ported to GPUs. For 
blocked Davinson scheme [48] and for the RMM-DIIS algorithm [49] speed-ups 
by a factor between 3 and 8 real-world examples with Fermi C2070 cards were 
achieved. Parallel scalability with 16 GPUs was similar to 16 CPUs. 

Quantum ESPRESSO [5(3 is a electronic structure code based plane wave 
basis sets and pseudo-potentials (PP). For GPU version [5T] the most compu- 
tationally parts of SCF-cycle were gradually transferred to run on GPUs. FFTs 
were accelerated by CUFFT, LAPACK by MAGMA and other routines were 
replaced by CUDA kernels. GEMM operations were replaced by parallel hybrid 
phiGEMM [52] library. For single node test systems running with NVIDIA Tesla 
C2050 speed-ups between 5.5 and 7.8 were achieved and for 32 node parallel sys- 
tem the speed-ups between 2.5 and 3.5 were observed. Wand et al. [53] did a 
GPU cluster implementation of plane wave pseudo-potential code called PEtot. 
They were able to achieve speed-ups in computationally time by factor of 6 to 
10 and parallel scalability up to 256 CPU-GPU computing units. 

GPAW [51] is a density-functional theory (DFT) electronic structure pro- 
gram package based on grid based projector augmented wave (PAW) method. 
We have used GPUs to speed up most of the computationally intensive parts 
of solving the code: solving the Poisson equation, iterative refinement of eigen- 
vectors, subspace diagonalization and orthonormalization of the wavefunctions. 
Overall we've achieved speed-ups up to 15 times on large systems and a good 
parallel scalability with up to 200 GPUs using NVIDIA Tesla M2070 cards. 

4 Quantum field theory 

Quantum field theories are currently our best models for fundamental interac- 
tions of the natural world (For a brief introduction to quantum field theories - 
or QFTs - see for example |55] or [56j and references therein). Common compu- 
tational techniques includes perturbation theory, which works well in quantum 
field theories as long as the couplings are small enough to be considered as per- 
turbations to free theory. Therefore, perturbation theory is the primary tool used 
in pure QED, weak nuclear force and high momentum-transfer QCD phenom- 
ena, but it breaks up when the coupling constant of the theory (the measure of 
the interaction strength) becomes large, such as in low-energy QCD. 

Formulating the quantum field theory on a space-time lattice provides an op- 
portunity to study the model non-perturbatively and use computer simulations 
to get results for a wide range of phenomena - it enables, for example, one to 
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compute the hadronic spectrum of QCD [57] from first principles and provides 
solutions for many vital gaps left by the perturbation theory, such as structure 
functions of composite particles [58], form- factors [59] and decay-constants [60] . 
It also enables one to study and test models for new physics, such as technicolor 
theories [61] and quantum field theories at finite temperature [62], [63]. For an 
introduction to Lattice QFT see for example [64] , [65] or [66] . 

Simulating quantum field theories using GPUs is not a completely new idea 
and early adopters even used OpenGL (graphics processing library) to program 
the GPUs to solve lattice QCD [67] - the early GPGPU-programmers needed to 
set up a program that draws two triangles that fill the output-texture of desired 
size by running a "shader program" , which does the actual computation, for 
each output-pixel and in this program the input-data could be then accessed by 
fetching pixels input texture(s) using the texture-units of the GPU. In lattice 
QFT, where one typically needs to fetch the nearest neighbor lattice-site values, 
this actually results in good performance as the texture caches and layouts of 
the GPUs have been optimized for local access-patterns for filtering purposes. 

4.1 Solving QFTs numerically 

The idea behind lattice QFT is based on the discretization of the path-integral 
solution to expectation values of time-ordered operators in quantum field the- 
ories. First one divides space-time into discrete boxes, called the lattice, and 
places the fields onto the lattice sites and onto the links between the sites, as 
shown in Fig. 3] Then one can simulate nature by creating a set of multiple field 
configurations, called an ensemble and calculate values of physical observables 
by computing ensemble averages over these states. 













































) 














4 























Fig. 4. The matter fields ^(x) live on lattice sites, whereas gauge fields U fl (x) live on 
links connecting the sites. Also depicted staples connecting to a single link variable, 
that are needed in computation of the gauge field forces. 



11 



The set of states is normally produced with the help of a Markov chain and 
in the most widely studied QFT, the lattice QCD, the chain is produced by com- 
bining a molecular dynamics algorithm together with a Metropolis acceptance 
test. Therefore the typical computational tasks in lattice QFTs are: 

1. Refresh generalized momentum variables from a heat bath (Gaussian distri- 
bution) once per trajectory. 

2. Compute generalized forces for fields for each step 

3. Integrate classical equations of motion for the fields at each stepH 

4. Perform a Metropolis acceptance test at end of trajectory in order to achieve 
the correct limiting distribution. 

In order to reach satisfying statistics, normally thousands of these trajectories 
need to be generated and each trajectory is typically composed of 10 to 100 
steps. The force calculation normally involves a matrix inversion, where the 
matrix indices run over the entire lattice. It is therefore the heaviest part of the 
computation - the matrix arises in simulations with dynamical fermions (normal 
propagating matter particles) and the simplest form for the Fermion matrix itjf] 

A x , y = [Q^Q]x,y where 

±4 

Qx,y = Sx,y - « ^ 5y+fi,x{ 1 + Jv)U^(x). (4) 

Here k is a constant related to the mass(es) of the quark(s), S x<y is the Kro- 
necker delta function (unit matrix elements), the sum goes over the space-time 
dimensions /i, 7^ are 4-by-4 constant matrices and U^{x) are the link variable- 
matrices that carry the force (gluons for example), from one lattice site to the 
neighbouring one - in normal QCD they are 3-by-3 complex matrices. 

The matrix A in the equation Ar — z, where one is solving for the vector r 
with given z, is an almost diagonal sparse matrix with predefined sparsity pattern. 
This fact makes lattice QCD ideal for parallelization, as the amount work done 
by each site is constant. The actual algorithm used in the matrix inversion is 
normally some variant of the conjugate gradient algorithm, and therefore one 
needs fast code to handle the multiplication of a fermion vector, by the fermion 
matrix. 

This procedure is the generation of the lattice configurations which form the 
ensemble. Once the set of configurations {Ui},i £ has been generated 

with the statistical weight e~ s ^ Ui \ where S[Ui] is the Euclidean action (action 
in imaginary time formulation) expectation value of an operator F[U] can be 
computed simply as 

1 N 



5 The integration is not done with respect to normal time variable, but through the 
Markov chain index- "time" . 

6 There are multiple different algorithms for simulating fermions, here we present the 
simplest one for illustrative purposes. 
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4.2 Existing GPU Solutions to Lattice QFTs 

As lattice QFTs are normally easily parallelizable they fit well into the GPU 
programming paradigm, which can be characterized as a parallel throughput 
machine. The conjugate gradient methods perform many fermion matrix vec- 
tor multiplications whose arithmetic intensity (ratio of floating point operations 
done per byte of memory fetched) is quite low, making memory bandwidth the 
normal bottleneck within a single processor. Parallelization between processors 
is done by standard MPI domain decomposition techniques. The conventional 
wisdom that this helps due to higher local volume to communication surface 
area ratio is actually flawed, as typically the GPU can handle a larger volume in 
the same amount of time, hence requiring the MPI-implementation to also take 
care of a larger surface area in the same time as with a CPU. In our experience 
GPU adoption is still in some sense in its infancy, as the network implementation 
seems to quickly become the bottleneck in the computation and the MPI im- 
plementations of running systems seem to have been tailored to meet the needs 
of the CPUs of the system. Another aspect of this is that normally the GPUs 
are coupled with highly powered CPUs in order to cater for the situation where 
the users use the GPUs in just a small part of the program and need a lots of 
sequential performance in order to try to keep the serial part of the program 
up with the parallel part. The GPU also needs a lot of concurrent threads (in 
the order of thousands) to be filled completely with work and therefore good 
performance is only achievable with relatively large local lattice-sizes. 

Typical implementations assign one GPU thread per site, which makes paral- 
lelization easy and gives the compiler quite a lot of room to find instruction-level 
parallelism, but in our experience this can result in a relatively high register 
pressure: The quantum fields living on the sites have many indices (normally 
color and Dirac indices) and are therefore vectors or matrices with up to 12 
complex numbers per field per site in the case of quark fields in normal QCD. 
Higher parallelization can be achieved by taking advantage of the vector-like 
parallelism inside a single lattice-site, but this may be challenging to implement 
in those loops where the threads within a site have to collaborate to produce a 
result, especially because GPUs impose restrictions on the memory layout of the 
fields (consecutive threads have to read consecutive memory locations in order to 
reach optimal performance [2]). In a recent paper [68] the authors solve the gauge 
fixing problem using overrelaxation techniques and they report an increase in 
performance by using multiple threads per site, although in this case the register 
pressure problem is even more pronounced and the effects of register spilling to 
LI were not studied. 

Currently, there are various groups using GPUs to do lattice QFT simula- 
tions. Standardization efforts for high precision Lattice QCD libraries are un- 
derway and the QUDA library (69] scales to hundreds of GPUs by using a local 
Schwarz-preconditioning technique, effectively eliminating all the GPU-based 
MPI communications for a significant portion of the calculation. They employ 
various optimization techniques, such as mixed-precision solvers, where parts of 
the inversion process of the fermion matrix is done at lower precision of floating 
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point arithmetic and using reduced representations of the SU3-matrices. Scaling 
to multiple GPUs can also be improved algorithmically: already a simple (almost 
standard) clover improvement |70] term in the fermion action leads to better lo- 
cality and of course improves the action of the model as well, taking the lattice 
formulation closer to the continuum limit. Domain decomposition and taking ad- 
vantage of restricted additive Schwarz (RAS) preconditioning using GPUs was 
already studied in 2010 in [71], where the authors get best performance on a 32 4 
lattice with vanishing overlap between the preconditioning domains and three 
complete RAS-iterations each containing just five iterations to solve the local 
system of 4 x 32 3 sites. It should be noted though that the hardware they used 
is already old, so optimal parameters with up-to-date components could slightly 
differ. 

Very soon after starting to work with GPUs on Lattice QFTs one notices 
the effects of Amdahl's law which just points out the fact that there is an upper 
bound for whole program performance improvement related to optimizing just 
a portion of the program. It is quite possible that the fermion matrix inversion 
takes 90% of the total compute time, but making this portion of the code run 10 
times faster reveals something odd: now we are spending half of our time com- 
puting forces and doing auxiliary computations and if we optimize this portion 
of the code as well, we improve our performance by a factor of almost two again 
- therefore optimizing only the matrix inversion gives us a mere fivefold perfor- 
mance improvement instead of the promised order of magnitude improvement. 
Authors of [72] implemented practically the entire HMC-trajectory on the GPU 
to fight Amdahl's law and recent work [73] on the QDP++ library implements 
Just-in- Time compilation to create GPU kernels on the fly to accommodate any 
non-performance critical operation over the entire lattice. 

Work outside of standard Lattice QCD using GPUs include implementation 
of the Neuberger-Dirac overlap operator [74] , which provides chiral symmetry at 
the expense of a non-local action. Another group uses the Arnoldi algorithm on 
a multi-GPU cluster to solve the overlap operator [75] and show scaling up to 
32 GPUs. Quenched SU2 [75] and later quenched SU2, SU3 and generic SU(iV c ) 
simulations using GPUs are described in [77] and even compact U(l) Polyakov 
loops using GPUs is studied in [78]. Scalar field theory - the so-called A0 4 - 
model - using AMD GPUs is studied in [79]. TWQCD collaboration has also 
implemented almost the entire HMC trajectory computation with dynamical 
Optimal Domain Wall Fermions, which improve chiral symmetry of the action 

While most of the groups use exclusively NVIDIA's CUDA-implementation 
[2], which offers good reliability, flexibility and stability, there are also some 
groups using the OpenCL standard [81]. A recent study [82] showed better 
performance on AMD GPUs than on NVIDIA ones using OpenCl, although 
it should be noted that the NVIDIA GPUs were consumer variants with re- 
duced double-precision throughput and that optimization was done for AMD 
GPUs. The authors of [72] have implemented both CUDA and OpenCL versions 
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of their staggered fermions code and they report a slightly higher performance 
for CUDA and for NVIDIA cards. 

4.3 Summary 

All in all, lattice QFT using GPUs is coming out of being a promising technol- 
ogy into a very viable alternative to traditional CPU-based computing. When 
reaching for the very best strong scaling performance - meaning best perfor- 
mance for small lattices - single threaded performance does matter if we assume 
that the rest of the system scales to remove other bottlenecks (communication, 
memory bandwith.) In these cases it seems that currently the best performance 
is achievable through high-end supercomputers such as the IBM Blue Gene/Q 
[83 , where actually the microprocessor architecture is starting to resemble more 
a GPU than a traditional CPU: the PowerPC A2 chip has 16 in-order cores, each 
supporting 4 relatively light weight threads and a crossbar on-chip network. A 
17th core runs the OS functions and 18th core is a spare to improve yields or 
take place of a damaged core. This design gives the PowerPC A2 chip similar 
performance to power ratio as a NVIDIA Tesla 2090 GPU, making Blue Genc/Q 
computers very efficient. One of the main advantages of using GPUs (or GPU- 
likc architectures) over traditional serial processors is the increased performance 
per watt and the possibility to perform simulations on commodity hardware. 

5 Wave function methods 

The stochastic techniques based on Markov chains and the Metropolis algorithm 
showed great success in the field theory examples above. There are also many- 
body wave function methods that use the wave function as the central variable 
and use stochastic techniques for the actual numerical work. These quantum 
Monte Carlo (QMC) techniques have shown to be very powerful tools to study 
electronic structures beyond the mean-field level of for example the density func- 
tional theory. A general overview of QMC can be found from 83] ■ The simplest 
form of the QMC algorithms is the variational QMC, where a trial wave func- 
tion with free parameters is constructed and the parameters are optimized, for 
example, to minimize the total energy |85j . This simple strategy works rather 
well for various different systems, even for strongly interacting particles in an 
external magnetic field [86 . 

There has been some works porting QMC methods to GPUs. In the early 
work by Amos G. Anderson et al. [87], the overall speedup compared to CPU 
was rather modest, from three to six, even if the individual kernels were up to 30 
times faster. More recently, Kenneth P. Esler et al. [88] have ported the QMC- 
Pack simulation code to the Nvidia CUDA GPU platform. Their full-application 
speedups is typically around 10 to 15 compared to a quad-core Xeon CPU. This 
speedup is very promising and demonstrates the great potential GPU computing 
has for the QMC methods that are perhaps the computational technologies that 
are the main-stream in the future electronic structure calculations. 
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There are also many-body wave function methods that are very close to the 
quantum chemical methods. One example of these is the full configuration in- 
teraction method of chemistry that is termed the exact diagonalization (ED) in 
physics. The activities in porting the quantum chemistry approaches to GPU are 
reviewed in [8 9) , and we try to remain on the physics side of this unclear border- 
line. We omit, for example, works on the coupled cluster method on GPU [90] . 
Furthermore, also quantum mechanical transport problems are not discussed 
here [91]. 

Lattice models [92(53] are important for providing a general understanding of 
many central physical concepts like magnetism. Furthermore, realistic materials 
can be cast to a lattice model [94]. Few-site models can be calculated exactly 
using the ED method. The ED method turns out to be very efficient on the GPU 
[95] . In the simplest form of ED, the first steps is to construct the many-body 
basis and the Hamiltonian matrix in it. Then follows the most time-consuming 
part, namely the actual diagonalization of the Hamiltonian matrix. In many 
cases, one is mainly interested in the lowest eigenstate and possibly a few of 
the lowest excited states. For these, the Lanczos algorithm turns out to be very 
suitable [95] • The basic idea of the Lanczos scheme is to map the huge but 
sparse Hamiltonian matrix to a smaller and tridiagonal form in the so called 
Krylov space that is defined by the spanning vectors, obtained from the starting 
vector \xo) by acting with the Hamiltonian as H m \xo). Now, as the GPU is very 
powerful for the matrix-vector product, it is not surprising that speedups of up 
to 100 can be found compared to CPUs |95j . 

6 Outlook 

The GPU has made a definite entry into the world of computational physics: 
the appearance of studies that mention the GPU as only side note are a clear 
indicator of this, as preliminary studies using emerging technologies will always 
be done, but the true litmus test of a new technology is whether studies emerge 
where the new technology is actually used to advance science. 

From the point of view of high performance computing in computational 
physics the biggest challenge facing GPUs at the moment is scaling: in the 
strong scaling case as many levels of parallelism as possible inherent in the 
problem should be exploited in order to reach for best performance with small 
local subsystems. The basic variables of the model are almost always vectors of 
some sort, making them an ideal candidate for SIMD-type of parallelism and 
this is often achieved with CPUs with a simple compiler-flag, which instructs 
the compiler to look for opportunities to combine independent instructions into 
vector-operations. 

Furthermore, large and therefore interesting problems from a HPC point 
of view are typically composed of a large number of similar variables, be it 
particles, field values, cells or just entries in an array of numbers, which hints at 
another, higher level of parallelism of the problem that traditionally has been 
exploited using MPI, but is a prime candidate for a data-parallel algorithm. Also, 
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algorithmic changes may be necessary to reach the best possible performance: 
it may very well be that the best algorithm for CPUs is no longer the best one 
for CPUs; A classic example could be the question whether to use lookup tables 
of certain variables, or recompute them on-the-fly - typically on the GPU the 
FLOPS arc cheap making the recomputation an attractive choice whereas the 
large caches of the CPU may make the lookup table a better option. 

On the other hand, MPI-communication latencies should be minimized and 
bandwidth increased to accommodate the faster local solve to help with both 
weak and strong scaling. As far as we know there are very few, if any, groups 
taking advantage of GPUDirect v. 2 for NVIDIA GPUs [96], which allows direct 
GPU-to-GPU MPI communications, reducing overhead and CPU synchroniza- 
tion needs. Even GPUDirect v.l helps, as then one can share the pinned memory 
buffers between Infiniband and GPU cards, removing the need to do extra local 
copies of data. The MPI implementations should also be scaled to fit the needs 
of the GPUs connected to the node. 

Another, perhaps an even greater challenge, facing GPUs and similar sys- 
tems is the ecosystem: Currently a large portion of the developers and system 
administrators like to think of GPUs and similar solutions as accelerators - an 
accelerator is a component, which is attached to the main processor and used 
to speed up certain portions of the code, but as these "accelerators" become 
more and more agile with wider support for standard algorithms, the term be- 
comes more and more irrelevant as a major part of the entire computation can 
be done on the "accelerator" and the original "brains" of the machine, the CPU, 
is mainly left there to take care of administrative functions, such as disk 10 and 
common OS services. 

As single threaded performance has reached a local limit, all types of proces- 
sors are seeking more performance out of parallelism: more cores are added and 
vector units are broadened. This trend, fueled by the fact that transistor feature 
sizes keep on shrinking, hints at some type of convergence in the near future, but 
exactly what it will look like is anyone's best guess. At least in computational 
physics, it has been shown already that the scientists are willing to take extra 
effort in porting their code to take advantage of massively parallel architectures, 
which should allow them to do the same work with less energy and do more 
science with the resources allocated to them. 

The initial programming effort does raise a concern for productivity: How 
much time and effort is one willing to spend to gain a certain amount of added 
performance? Obviously, the answer depends on the problem itself, but perhaps 
even more on the assumed direction of the industry - a wrong choice may result 
in wasted effort if the chosen solution simply does not exist in five years time 
and this may depend strongly on consumer behavior, since a large part of the 
development costs of these machines are actually subsidized by the development 
of the consumer variants of the products. Designing a processor only for the 
HPC market is too expensive and a successful product will need a sister or at 
least a cousin in the consumer market. Which brings us back to DOOM: it may 



17 



very well be that the technology developed for the gamers of today, will be the 
programming platform for the scientists of tomorrow. 
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